APPFS Exercise 1

The following is able to process the entire data file of 100 million lines, it just takes forever on my computer, therefore I am submitting this assignment without the program having being run on the entire 100 million dataset. Instead, this program runs on a smaller subset of the dataset... specifically, the first 20 lines of the 100 million dataset. My code works for the data-subset (as seen by the pandas display and the printed mean), therefore I induce that it will also work on the entire dataset. \n

To run this program on your own more-powerful system, just run all the cells until the end. \n

I am blindly hoping that my program obtains the correct mean. I can't confirm and make further adjustments myself, since my computer is taking too long to run on the dataset. But in the case that my mean output is not correct, I would adjust my program so that it enforces that the SeqNo = index + 1, otherwise I will filter some data points out.


In [94]:
import pandas as pd

data = open("./ex1-100.dat").readlines()

df_data = pd.DataFrame(data, columns = ['original'])
df_data['clean'] = df_data.apply(lambda row: row['original'][0].isdigit(), axis=1) # determine if each line is valid data
df_dataclean = pd.DataFrame(df_data[df_data['clean'] == True]['original']).reindex() # eliminate the lines with comments
df_dataclean = df_dataclean.reset_index().drop('index', axis=1)

df_dataclean['split'] = df_dataclean.apply(lambda row: row['original'].split(';'), axis=1)
df_dataclean['SeqNo'] = df_dataclean.apply(lambda row: int(row['split'][0].strip()), axis=1)
df_dataclean['Location'] = df_dataclean.apply(lambda row: int(row['split'][1].strip()), axis=1)
df_dataclean['Value'] = df_dataclean.apply(lambda row: float(row['split'][2].strip()), axis=1)

In [77]:
df_dataclean.head()


Out[77]:
original split SeqNo Location Value
0 1; 2; 10.5196\n [1, 2, 10.5196\n] 1 2 10.5196
1 2; 1; 85.2815\n [2, 1, 85.2815\n] 2 1 85.2815
2 3; 2; 63.8276\n [3, 2, 63.8276\n] 3 2 63.8276
3 4; 2; 84.4309\n [4, 2, 84.4309\n] 4 2 84.4309
4 5; 1; 73.6285\n [5, 1, 73.6285\n] 5 1 73.6285

In [78]:
mean1 = df_dataclean[df_dataclean['Location'] == 1]['Value'].mean()
mean2 = df_dataclean[df_dataclean['Location'] == 2]['Value'].mean()


mean1: 66.14905714285713
mean2: 54.21118

In [ ]:
# if mean not as expected, do further data cleaning. ex enforce that seqno = index + 1
print('Valid values Loc1: ' + str(df_dataclean['Location'].value_counts()[1]) + 'with GeoMean: ' + str(mean1))
print('Valid values Loc2: ' + str(df_dataclean['Location'].value_counts()[2]) + 'with GeoMean: ' + str(mean2))